Online-Academy
Look, Read, Understand, Apply

Data Mining And Data Warehousing

k-means clustering algorithm

K-means Clustering Algorithm

K-Means is a popular unsupervised learning algorithm used for clustering, it partitions a dataset into k (k is number of clusters needed) distinct, non-overlapping groups (clusters) based on similarity.
Working of K-means
  • Consider a dataset with n data points and a desired number of clusters k:
  • Initialize:
  • Choose k cluster centroids randomly.
  • Assignment Step:
  • Assign each data point to the nearest centroid (based on Euclidean distance).
  • Update Step:
  • Recalculate the centroids as the mean of all points assigned to each cluster.
  • Repeat:
  • Steps 2 and 3 are repeated until:Centroids no longer move significantly (convergence), or A maximum number of iterations is reached.
Advantages
  • Fast and efficient for large datasets.
  • Easy to implement and interpret.
  • Works well with spherical, well-separated clusters.
Limitations
  • Must specify k beforehand.
  • Sensitive to outliers and initial centroids.
  • Assumes clusters are isotropic (uniform in all directions) and equally sized.
  • Poor performance on non-convex clusters or clusters of different densities.